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Abstract 

When dealing with subjective, noisy, or otherwise nebu- 
lous features, the "wisdom of crowds" suggests that one 
may benefit from multiple judgments of the same feature 
on the same object. We give theoretically-motivated fea- 
ture multi-selection algorithms that choose, among a large 
set of candidate features, not only which features to judge 
but how many times to judge each one. We demonstrate 
the effectiveness of this approach for linear regression on 
a crowdsourced learning task of predicting people's height 
and weight from photos, using features such as gender and 
estimated weight as well as culturally fraught ones such as 
attractive. This work has been published in Sabato & Kalai 
(2013). 

1. Introduction 

In this paper we consider prediction with subjective, 
vague, or noisy attributes (which are also termed 'features' 
throughout this paper). Such attributes can sometimes be 
useful for prediction, because they account for an impor- 
tant part of the signal that cannot be otherwise captured. In 
a crowdsourcing setting, the "wisdom of crowds" suggests 
that including multiple assessments of the same feature by 
different people may be useful. Henceforth, we refer to as- 
sessments of features as judgments. This paper introduces 
the problem of selecting, from a set of candidate features, 
which ones to use for prediction, and how many judgments 
to acquire for each, for a given budget limiting the total 
number of judgments. We give theoretically justified al- 
gorithms for this problem, and report crowdsourced exper- 
imental results, in which judgments of highly subjective 
features (even culturally fraught ones such as attractive) 
are helpful for prediction. 

As a toy example, consider the problem of estimating the 
number of jelly beans in a jar based on an image of the 
jar. A linear regressor with multiple judgments of features 
might have the form, 

y =0.95(est. number of beans)/ 5 — 50 (round jar)/ 2 + 
100 (monochromatic)/ 1 + 30 (beautiful)/ 3 . 



Here, for binary attributes, a^ Ta € [0,1] denotes the frac- 
tion of positive judgments out of r a judgments of attribute 
a. For real-valued attributes, al Ta denotes the mean of r a 
judgments. The shape, number of colors, and attractiveness 
of the jar each help correct biases in the estimated number 
of beans, averaged across five people. Our goal is to choose 
a regressor that, as accurately as possible, estimates the la- 
bels (i.e., jelly bean counts) on future objects (i.e., jars) 
drawn from the same distribution, while staying within a 
budget of feature judgment resources per evaluated object 
at test time. In the example above, notice that even though 
the monochromatic coefficient is greater than the beautiful 
coefficient, fewer monochromatic judgments are used, be- 
cause counting the number of colors is more objective, and 
hence further judgments are less valuable. While this ex- 
ample is contrived, similar phenomena are observed in the 
output of our algorithms (see 2). 

We refer to the problem of selecting the number of repe- 
titions, r a , of each attribute, as the feature multi-selection 
problem, because it generalizes the feature selection prob- 
lem of choosing a subset of features, i.e., r a E {0, 1}, 
to choosing a multiset of features, i.e., r a 6 N. Since 
the feature selection problem is well known to be NP-hard 
(Natarajan, 1995), our problem is also NP-hard in the gen- 
eral case. (For a formal reduction, one simply considers the 
"objective" case where all judgments of the same feature- 
object pair are identical.) Nonetheless, several success- 
ful approaches have been proposed for feature selection. 
The algorithms that we propose generalize two of these ap- 
proaches to the problem of feature multi-selection. 

Our algorithms are theoretically motivated, and tested 
on synthetic and real-world data. The real world 
data are photos extracted from the publicly available 
Photographic Height/Weight Chart 1 , where people post 
pictures of themselves announcing their own height and 
weight. 

As a more general motivation, consider a scientist who 
would like to use crowdsourcing as an alternative to them- 
selves estimating a value for each of a large data set of 

1 http://www.cockeyed.com/photos/bodies/heightweight.html 
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objects. Say the scientist gathers multiple judgments of 
a number of binary or real-valued attributes for each ob- 
ject, and uses linear regression to predict the value of in- 
terest. In some cases, crowdsourcing is a natural source 
of judgments, as a great number of them may be acquired 
on demand, rapidly, and at very low cost. We assume the 
scientist has access to the following information: 

• A labeled set of objects (o, y) e O x y (with no 
judgments), where O is a set of objects and y C R is 
a set of ground-truth labels drawn independently from 
a distribution T>. 

• A crowd, which is a large pool of workers. 

• A possibly large set of candidate attributes A. For 

any attribute a e A and object o e O, the judgment 
of a random worker from the crowd may be queried at 
a cost. 

• A budget B, limiting the number of attribute judg- 
ments to be used when evaluating the regressor on a 
new unseen object. 

Our approach is as follows: 

1 . Collect k > 2 judgments for each candidate attribute 
in A, for each object in the labeled set. 

2. Based on this data and the budget, decide how many 
judgments of each attribute to use in the regressor. 

3. Collect additional judgments (as needed) on the la- 
beled set so that each attribute has the number of judg- 
ments specified in the previous step. 

4. Find a linear predictor based on the average judgment 
of each feature. 2 

Step 4 can be accomplished by simple least-squares regres- 
sion. The goal in Step 2 (feature multi-selection) is to de- 
cide on a number of judgments per attribute that will hope- 
fully yield the smallest squared error after Step 4. 

Interestingly, even given as few as k = 2 judgments per 
attribute, one can project an estimate of the squared error 
with more than k judgments of some features. We prove 
that these projections are accurate, for any fixed k > 2, as 
the number of labeled objects increases. Our algorithms 
perform a greedy strategy for feature multi-selection, to at- 
tempt to minimize the projected loss. This greedy strategy 
can be seen as a generalization of the Forward Regression 
approach for standard feature selection (see e.g. Miller, 
2002). The first algorithm operates under the assumption 

2 We focus on mean averaging, leaving to future work other 
aggregation statistics such as the median. 



that different attributes are uncorrected. In this case the 
projection simplifies to a simple scoring rule, which incor- 
porates attribute-label correlations as well as a natural no- 
tion of inter-rater reliability for each attribute. In this case, 
greedy selection is also provably optimal. While attributes 
are highly correlated in practice, the algorithm performs 
well in our experiments, possibly because Step 4 corrects 
for a small number of poor choices during feature multi- 
selection. The second algorithm attempts to optimize the 
projection without any assumptions on the nature of corre- 
lations between features. 

While crowdsourcing is one motivation, the algorithms 
would be applicable to other settings such as learning from 
noisy sensor inputs, where one may place multiple sensors 
measuring each quantity, or social science experiments, 
where one may have multiple research assistants (rather 
than a crowd) judging each attribute. 

The main contributions of this paper are: (a) introducing 
the feature multi-selection problem, (b) giving theoreti- 
cally justified feature multi-selection algorithms, and (c) 
presenting experimental results, showing that feature multi- 
selection can yield more accurate regressors, with different 
numbers of judgments for different attributes. 

Related Work 

Related work spans a number of fields, including Statistics, 
Machine Learning, Crowdsourcing, and measurement in 
the social sciences. A number of researchers have studied 
attribute-efficient prediction (also called budgeted learn- 
ing) assuming, as we do, that there is a cost to evaluating 
attributes and one would like to evaluate as few as possible 
(see, for instance, the recent work by Cesa-Bianchi et al. 
(2011) and references therein). In that line of work, each 
attribute is judged at most once. The errors-in-variables 
approach (e.g., Cheng & Van Ness, 1999) in statistics esti- 
mates the 'true' regression coefficients using noisy feature 
measurements. This approach is less suitable in our setting, 
since our final goal is to predict from noisy measurements. 

A wide variety of techniques have been studied to combine 
estimates of experts or the crowd of a single quantity of in- 
terest (see, e.g. Dawid & Skene, 1979; Smyth et al., 1994; 
Welinder et al., 2010), like estimating the number of jelly 
beans in ajar from a number of guesses. 

Two recent works on crowdsourcing are very relevant. 
Patterson & Hays (2012) crowdsourced the mean of 3 judg- 
ments of each of 102 binary attributes on over 14,000 
images, yielding over 4 million judgments. Some of 
their attributes are subjective, e.g., soothing. We employ 
their crowdsourcing protocol to label our binary attributes. 
Isola et al. (2011) study subjective and objective features 
for the task of estimating how memorable an image is, by 
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taking the mean of 10 judgments per attribute for each im- 
age. They perform greedy feature selection over these at- 
tributes to find the best compact set of attributes for pre- 
dicting memorability. The key difference between their al- 
gorithm and ours is that theirs does not choose how many 
judgments to average. Since that quantity is fixed for each 
attribute, their setting falls under the more standard feature 
selection umbrella. In our experiments we compare this 
approach to our algorithms. 

Finally, in the social sciences, a wide array of techniques 
have been developed for assessing inter-rater reliability of 
attributes, with the most popular perhaps being the a coef- 
ficient (Cronbach, 1951). A principal use of such measures 
is determining, by some threshold, which features may be 
used in content analysis. For an overview of reliability the- 
ory, see (Krippendorff, 2012). 

2. Preliminary assumptions and definitions 

Let there be d candidate attributes called A = [d] = 
{1,2,..., d}. We assume that, for any object o and at- 
tribute a, there is a distribution over judgments P[X[a] | 
O = o], and we assume that the judgments of attribute- 
object pairs are conditionally independent given the sets of 
attributes and objects. This represents an idealized setting 
in which a new random crowd worker is selected for each 
attribute-object judgment (In our experiments, we limit the 
total amount of work that any one worker may perform). 
We assume a distribution T> over labeled objects, where 
labels are real numbers. We denote by T>o the marginal 
distribution over objects drawn according to T>. We let 
P[X[a]] = P ~x> [^[a] I O]}- Labels y are assumed to 
be real valued. As is standard, we assume one "true" label 
Hi for each object Oi. 

For notational ease, we assume that in the feature multi- 
selection phase, exactly k > 2 judgments for each feature 
are collected. Our analysis trivially generalizes to the set- 
ting in which different attributes are judged different num- 
bers of times. Finally, each attribute a is assumed to have 
an expected value of E [X [a]} = 0, where the expectation 
is taken across objects and judgments of a. This is done 
for ease of presentation, so that we do not have to track 
the mean vectors as well as the variance. When discussing 
implementation details, we describe how to remove this as- 
sumption in practice without loss of generality. 

Vectors will be boldface, e.g., x = (x[l], . . . ,x[d]), ran- 
dom variables will be capitalized, e.g., X, and matrices will 
be in black-board font, e.g., X. The i'th standard unit vec- 
tor is denoted by e j . 

Let r £ N d represent the number of judgments for each 
feature, so that attribute a is judged r[a] times, and we rep- 



resent the object's judgments by x, defined as: 

x=(Ml](i)>g,...,(a ; [d](i))g), 

where a;[a](j) is the jth judgment of attribute a in x, and 

(a:[ffl](j))^=i i s a vector with x[a](j) in coordinate j. We 
say that r is the repeat vector of x. We denote the set of all 
possible representations with repeat vector r by R[ r ' . 

We denote by D T the distribution which draws (X, Y) £ 
Rl r l x R by first drawing a labeled object (O, Y) from V, 
and then drawing a random representation X £ R^ for 
this object. We denote by Doo the distribution that draws 
(X, Y) where X £ R d by first drawing (O, Y) from V and 
then setting X[a] = K[X[a] | O]. We denote the expec- 
tation over D r by E r = V,rx.,Y)~D r - F° r A» we denote 

For k > 2, let k = (k,k,...,k) G N d be the re- 
peat vector used in the first training phase. The feature 
multi-selection algorithm receives as input a labeled train- 
ing set 5" = ((xi,j/i), . . . , (x m , y m j) where x, £ R kd and 
Hi £ R, drawn from D^. This sample is generated by first 
drawing a set of labeled objects ((oi , yi), . . . , (o m ,y m )) 
i.i.d. from T>, and then drawing a random representation 
Xj for object Oi. The algorithm further receives as input a 
budget B £ N, which specifies the total number of feature 
judgments allowed for each unlabeled object at test (i.e., 
prediction) time. The output of the algorithm is a new vec- 
tor of repeats r £ Rb, where, 

R B ^{reN d \^2 aGA r[a]<B}. 

Let o be an object with a true label y, and let y be a predic- 
tion of the label of o. The squared loss for this prediction is 
V-hii V) — (v ~ v) 2 ■ Given a function / : Z — » R for some 
domain Z, and a distribution D over Z x R, we denote the 
average loss of / on D by 

£(f,D) = E (z>Y) ^ D [t(f(Z),Y)}. 

The final goal of our procedure is to find a predictor with a 
low expected loss on labeled objects drawn from T>. This 
predictor must use only B feature judgments for each ob- 
ject, as determined by the test repeat vector r. We consider 
linear predictors w £ R d that operate on the vector of av- 
erage judgments of x £ R^, defined as follows: 

(0 if r[a] = 0. 

For an input representation x, the predictor w predicts the 
label (w, x). For vector v £ M. d , we denote by Diag(v) £ 
R dxd the diagonal matrix with v[a] in the ath position. 
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For a vector r <E N d and a matrix § e l dxd , we denote by 
sub r (§) the submatrix of § resulting from deleting all rows 
and columns a such that r[a] = 0. For a vector, sub r (u) 
omits entries a such that r[a] = 0. Here sub r (u) € M. d 



and sub r (S) S 



pd xd 



, where dl is the support size of r. 



We denote the pseudo-inverse of a matrix A 6 R nx ™ (see 
e.g. Ben-Israel & Greville, 2003) by A+. 

3. Feature Multi-Selection Algorithms 

The input to a feature multi-selection algorithm is a budget 
B and m labeled examples in which each attribute has been 
judged k times, and the output is a repeat vector r £ Rg. 
Our ultimate goal is to find r and a predictor w £ R d such 
that ^(w, D r ) is minimal. We now give intuition about the 
derivation of the algorithms, but their formal definition is 
given in Alg. 1. 

Define the loss of a repeat vector to be £(r) = 
mm weK d £(w, D r ). The goal is to minimize £(r) over 
r £ Rb- We give two forward-selection algorithms, both 
of which begin with r = (0, . . . , 0) and greedily increment 
r[a] for a that most decreases an estimate of l(r). The key 
question is how does one estimate this projected loss £(r) 
since the number of judgments can exceed k. We simplify 
notation by first considering only r which are positive, i.e., 
r[a] > 1 for each a. We will shortly explain how to handle 
r[a] =0. Define 

b = E[XY], and S r = E r [X T X]. 

We call b[a) the correlation of a with the label. Note that 
b = Ek[Xy], since linearity of expectation implies that b 
does not depend on k. Straightforward calculations show 
that, for any positive repeat vector r, If S r is non-singular, 3 

£(r) = miriE r [(w T X - Y) 2 } = E r [Y 2 } - t^E^b. 

w 

Since E[Y 2 ] does not depend on r, minimizing ^(r) is 
equivalent to maximizing b T £~ 1 b (for positive r and non- 
singular E r ). 

3.1. A Scoring Algorithm 

The first algorithm that we propose is derived from the 
zero-correlation assumption, that E[X[a]X[a']] = for 
a 7^ a', or equivalently that the covariance matrix is di- 
agonal. Perhaps the simplest approach to standard feature 
selection is to score each feature independently, based on 
its normalized empirical correlation with the label, and to 
select the B top-scoring features. If features are uncor- 
rected and the training sample is sufficiently large, then 
this efficient approach finds an optimal set of features. The 
feature multi-selection scoring algorithm that we propose 



henceforth is optimal under similar assumptions, however 
it is complicated by the fact that we may include multiple 
repetitions of each feature. Under the zero-correlation as- 
sumption, E r is diagonal, and its ath element, for r[a] > 0, 
can be expanded as 

E r [(A>]) 2 ] = a 2 [a] + 44 > where 



v[a\ 
.2 r 



E ~-D [Var[X[a] \ O]} and 
a' [a] =Eoo [{X[a}) 2 ] . 

We refer to v[a] as the internal variance as it measures the 
"inter-rater reliability" of a, and we call a 2 [a] the external 
variance as it is the inherent variance between examples. 
Hence for a diagonal S r , simple manipulation gives, 

_ (b[a}) 2 



E[Y 2 } - £{r) = 



r P 2 \a] 
a:r[a\>0 L J 



(1) 



rfoj 



Therefore, when S r is diagonal, minimizing the projected 
loss is equivalent to maximizing the RHS above, a sum of 
independent terms that depend on the correlation and on the 
internal and external variance of each attribute, all of which 
can be estimated just once, for all possible repeat vectors. 
As one expects, greater correlation indicates a better fea- 
ture, while a greater external variance indicates a worse fea- 
ture. A larger internal variance indicates that more repeats 
are needed to achieve prediction quality. 

To estimate Eq. (1) we estimate each of the components on 
the RHS. Unbiased estimation of b is straightforward, and 
unbiased estimation of v is also possible for k > 2 samples 
per object, though importantly one should use the unbiased 
variance estimator, 



= - 53 VarEst(xi[a](l), . . . , Xi[a](j)), 

i 



(2) 



VarEst(ai 



j'<£[n] 



For singular E r , the pseudo-inverse replaces E r 



Using these estimates of v, we estimate the external vari- 
ance using the equality a 2 [a] = E k [(AT[a]) 2 ] - ^1. A 
slight complication arises here, as this estimate might be 
negative for small samples, so we round it up to when 
this happens. Another issue might seem to arise when the 
denominator of one of the summands in Eq. (1) is zero, 
however note that this can only occur if both the internal 
and the external variance are zero, which implies that the 
feature is constantly zero, thus zeroing its correlation as 
well. The same holds for the estimated ratio. In such cases 
we treat the ratio as equal to 0. 

3.2. The Full Multi-Selection Algorithm 

The scoring algorithm is motivated by the assumption of 
zero correlation between features. However, this assump- 
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Algorithm 1 Feature multi-selection algorithms 

1: Input: Budget B; ((xi, yi), . . . , (x m , y m )) & 
Algorithm type: Scoring/Full. 
Output: A repeat vector r £ Rb. 

Xi\a](j) for i £ [m], a £ A. 



4 
5 
6 
7 

8 
9 
10 

11 

12 
13 
14 
15 
16 
17 
18 
19 



Xi[Cl\ <r- i Ylj£[k] 



M <- £ Ei VarEst(ii[o](l), . . . ,Xi[a](k)). 
if Scoring Algorithm then 

Va £ A, a 2 [a] <- max {o, i EifeH) 2 - ¥}• 

Define obj(r) ee E a:r[o]>0 SWV^H + $}) 
else 

S <- MakePSD (± £. xf x, - Diag(v)/fc)) 

M r EE S ub r (E + Diag(ff,...,K|)) 

Define obj(r) = sub r (b) T M+sub r (b) 
end if 

r <- (0, . . . , 0) £ N d 
for t = 1 to B do 

Find ibest € [rf] such that obj(rt_i +ej) is maximal. 

r* <- v t -i +e ibost . 
end for 
Return r # . 



tion rarely holds in practice. Building on and parallel- 
ing the definitions and derivation above, the Full Algo- 
rithm similarly maximizes b T E I T 1 b without this assump- 
tion. For positive r, one has 

E r = E + Diag(v[l]/r[l],...,«[d]/r[d]) 

Where E = E oc [X T X] is the external covariance ma- 
trix, and we estimate it based on the equality E = Ek — 
Diag(v)/fc. Just as in the Scoring algorithm, the estimates 
of a 2 [a] might be negative, in the full algorithm it is possi- 
ble that the estimate of E will not be positive semi-definite, 
so we analogously "round up" our estimate of E to the near- 
est PSD matrix (see implementation details below). The es- 
timate when some of the r[a]'s are zero is formed by delet- 
ing the corresponding entries in the estimate of b and the 
corresponding rows and columns in the estimate of E r . 

3.3. Guarantees 

Under our distributional assumptions, we show that the es- 
timated objective functions used by our algorithms con- 
verge to E[y 2 ] — ^(r). Thus maximizing the estimated 
objective approximately minimizes £(r). Formally, let 
objj(r) and obj s (r) be the objectives used in Alg. 1 
for the full algorithm and the Scoring algorithm, respec- 
tively. Note that these objectives are implicitly functions 
of the training sample S. For a symmetric matrix S, let 
Amin(S) be the smallest eigenvalues of S. We define: 
A = min re ij B A m i n (sub r (E)), and B = min(B, d). 



Theorem 3.1. Suppose that all judgments and labels are 
in [—1, 1]. Then for any S £ (0, 1), with prob. at least 1 — 5 
over m Ltd. training samples from D^, for all r £ Rs,for 
to > h(Bh\(Bd/8)/\ 2 ) we have 



B 3 \n(Bd/S)\ 
A 2- 



\obi f (r)-(E[Y 2 ]-e(r))\ <0 



If the external covariance matrix E is diagonal, then for 
to > ^(ln(d/<5)/A 2 ) we have 



|obj s (r)-(E[y 2 ]-^(r))| <0 



f ln{Bd/6) \ 
V A 2 v^T ) 



The proof of this theorem is provided in Appendix A. The 
convergence rate for the full algorithm stems from two 
bounds: (1) If the norm of the minimizing w is at most 
a, then the convergence rate is at most Ba 2 / y/m; (2) With 
high probability, the norm of the minimizing w is at most 
\f~B I A. An additional factor of 0(B \n(Bd)) gets uniform 
convergence over r £ Rb- The components of this result 
are of the same order as the equivalent results for uniform 
convergence of standard least-squares regression. An im- 
proved rate of y Ba 2 /m can be achieved for least-squares 
regression, the algorithm exactly minimizes the sample 
squared loss (Srebro et al., 2010). However, our algorithm 
minimizes another objective, thus this result is not directly 
applicable. We leave it as a challenge for future work to 
find out whether a faster rate can be achieved in our case. 

As always, these convergence rates are worst-case, and 
in practice a much smaller sample size is often suffi- 
cient to get meaningful results, as we have observed in 
our experiments. However, if the available training sam- 
ple is too small to achieve reasonable results, one can 
limit the norm of the minimizer by adding regularization 
to the estimated covariance matrix, as in ridge regression 
(Hoerl & Kennard, 1970). This would allow faster conver- 
gence at the expense of a more limited class of predictors. 

As Theorem 3.1 shows, when the zero-correlation assump- 
tion holds, the Scoring algorithm enjoys a much faster 
worst-case rate of convergence than the full algorithm. This 
is because it does not attempt to estimate the entire covari- 
ance matrix. This advantage is more significant for larger 
budgets. An additional advantage is that it finds the optimal 
value of r for its estimated objective: 

Theorem 3.2. The Scoring algorithm returns r £ 
argmaxj-efts obj s (r). 

Theorem 3.2 follows since f(r) = a/(b + c/r) is concave 
and increasing in r and due to the following observation. 

Lemma 3.3. Let r £ N d , and let /(r) = J2i£[d\ Si( r [*D- 
where gi(-) : K+ —> K are monotonic non-decreasing con- 
cave functions. Let B £ N. The maximum of f(r) subject 
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to r € Rb is attained by a greedy algorithm which starts 
with r = (0, . . . , 0), and iteratively increases the coordi- 
nate which increases f the most. 



The proof of this lemma is provided in Appendix A. 
3.4. Implementation 

If our estimate of £ is not PSD, we use the procedure 
'MakePSD', which takes a symmetric matrix A as input, 
and returns the PSD closest to A in Frobenius norm. This 
can be done by calculating the eigenvalue decomposition 
A = UDU T where U is orthogonal and D is diagonal, and 
returning UB1LJ T , where D is D with zeroed negative en- 
tries (Higham, 1988). If we assume a diagonal external 
covariance, then this procedure is equivalent to rounding 
up the estimate of cr 2 (a) to zero, as done in the Scoring 
algorithm. For a budget of B, the full algorithm performs 
Bd S VDs to calculate pseudo-inverses. Note, however, that 
the largest matrix that might be decomposed here is of size 
mm(d, B) x min(d, B). Furthermore, in practice the matri- 
ces can be much smaller, since the algorithm might choose 
several repeats of the same features. In our experiments, 
the total time for decompositions, using standard libraries 
on a standard personal computer, has been negligible. 

Our description of the algorithms above assumes for sim- 
plicity that the mean of all features is zero. In practice, one 
adds a 'free' feature that is always 1, to allow for biased 
regressors. For the Scoring algorithm, one should further 
subtract the empirical mean from each feature. For the full 
algorithm, this not necessary, because when bias is allowed, 
adding a constant to any feature provably will not change 
the output of the full algorithm. 



4. Experiments 

We tested our approach on three regression problems. In 
the first problem the feature judgments were simulated. In 
the second and third problem they were collected from the 
crowd using Amazon's Mechanical Turk. 4 

For the simulated experiment we used the UCI dataset 
'Relative location of CT slices on axial axis Data Set' 
(Frank & Asuncion, 2010). In this dataset the features are 
histograms of spatial measurements in the image, and the 
label to predict is the relative location of the image on the 
axial axis. To simulate features with varying judgments, 
we collapsed each set of 8 adjacent histogram bins into a 
single feature, so that each judgment of the new feature 
was randomly chosen out of 8 possible values for this fea- 
ture. The resulting dataset contained 48 noisy real-valued 
features per example. 

The second and third problems were to predict the 
height and weight of people from a photo. 880 photos 
with self-declared height and weight were extracted from 
the publicly available Photographic Height/Weight Chart 
(Cockerham, 2013), where people post pictures of them- 
selves announcing their own height and weight. We chose 
37 attributes that we felt the crowd could judge and might 
be predictive. We collected judgments for these binary at- 
tributes, mainly following the judgment collection method- 
ology of Patterson & Hays (2012), by batching the im- 
ages into groups of 40, making labeling very efficient. To 
encourage honest workers, we promised (and delivered) 
bonuses for good work. We further limited the amount of 
work any one person could do. We used all of the col- 
lected judgments, regardless of whether the workers re- 
ceived bonuses for them or not. Our pay per hour was set 

4 http : / /mturk . com. We will share our data upon request 
from other researchers, due to the sensitivity of judgments on peo- 
ple's images. 
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Figure 1. Properties of selected attributes for height prediction 



Figure 2. Properties of selected attributes for weight prediction 
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to average to minimum wage. We collected numerical es- 
timates of the height and the weight in a similar fashion. 
Binary judgments took about one second per judgment and 
their cost was a fraction of a cent per attribute judgment. 
The numerical estimates took about four times as long and 
we paid four times as much for them. Accordingly, we ad- 
justed all the algorithms to count a single numerical judg- 
ment as equal to four binary attribute judgments. 

Figure 1 and Figure 2 show the normalized correla- 
tion (&[a]/o"[a]) vs. the normalized inter-rater reliability 
(t)[a]/<r[a]) of selected attributes. These plots demonstrate 
that all combinations of useful/non-useful and stable/noisy 
attributes exist in this data. The full data listing all the at- 
tributes and their properties is provided in Table 1. 

Table 1 Lists all the attributes that were collected for the 
height and weight prediction problem, their internal vari- 
ance and their normalized correlation for each of the pre- 
diction tasks. 

We compared the test error of our algorithms, denoted 
'Full' and 'Scoring' in the plots, to those of several plau- 
sible baselines. In all comparisons, we set k = 2. The 
first baseline, denoted 'Averages' in the plots, is based on 
the "predictive" feature selection algorithm of Isola et al. 
(2011): We first average the 2 judgments per attribute to 
create a standard data set with one value for each object- 
attribute pair, and then greedily add attributes, one at a 
time, so as to minimize the least-squares error. The re- 
sulting regressor uses 2 judgments for each selected fea- 
ture. The second baseline, denoted 'Copies', treats the 2 
judgments of each feature-object pair as 2 different indi- 
vidual attributes, and again performs greedy forward selec- 
tion on these features. Here the test repeat vector r was 
set according to the number of copies selected for each fea- 
ture. Note that these baselines perform standard Machine 
Learning feature selection: Averages' considers d features 
and 'Copies' considers 2d features. For height and weight 
prediction, we compared the results also to the test error 
achieved by averaging only the height or weight estimates 
of the crowd, respectively. Since each numerical feature 
costs 4 times as much as a binary feature, we averaged 
over B /4 numerical judgments when the budget was set 
to B. We did not use regularization anywhere, thus our 
algorithms and the baselines are all parameter-free. 

The test error presented in the plots was obtained as fol- 
lows: r was selected based on a training set with k judg- 
ments. We then added judgments to features in the training 
set to get to r repeats. Finally we performed regular re- 
gression on the means of the enhanced training set to get a 
predictor. This predictor was then used to predict the labels 
of the test set with r judgments. In all the comparisons, 
each experiment was averaged over 50 random train/test 
splits. In all of the experiments, shown in figures 3-7, our 
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beautiful 


1.80 
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-0.07 


tall 


1.81 


0.12 
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dark- skin 
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0.00 


fair- skin 


2.38 
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-0.02 


short 
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0.05 


0.08 


good-posture 


3.26 


0.02 


-0.05 



Table 1. All attributes used for height and weight prediction, 
'var' indicates the normalized inter-rater variability (v[a]/a[a\). 
'height' and 'weight' indicate the estimated quality of each of the 
attributes for the respective prediction task (6[a]/<r[a]). 
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full algorithm achieved better test error than the baselines. 
The Scoring algorithm was usually somewhat worse than 
the Full Multi-Selection algorithm, and for small budgets 
also sometimes worse than the baselines, This is expected 
due to its zero-correlation assumption. However, when the 
sample size was small, the Scoring algorithm was some- 
times better (see e.g., Figure 6), since it suffered from less 
over-fitting. This is consistent with our convergence anal- 
ysis in Theorem 3.1. Analysis of training errors indicates 
that baseline algorithms suffer for two different reasons: 
(1) they are limited to a small number of repeats per fea- 
ture; and (2) they suffer from greater over-fitting. The sec- 
ond reason is probably due to the fact that our algorithm 
tends to select a sparser r than do the baselines. Table 2 
shows examples of predictors, with number of judgments 
for each attribute, learned by our full algorithm. 
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In our last experiment we tested the tradeoff between the 
number of training judgments per feature, and the num- 
ber of training examples, in the following setting: Sup- 
pose we have a budget that allows us to collect a total of 
M judgments for training the feature multi-selection algo- 
rithm, and we have access to at least M /2d labeled exam- 
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pies. We can decide on a number k of judgments per fea- 
ture, randomly select M/kd objects from our labeled pool 
to serve as the training set, and obtain kd judgments for 
each of these objects. What number k should we choose? 
Does this number depend on the total budget Ml We com- 
pared the test error arising from different values of k over 
different values of M, for the slice dataset using both of 
our algorithms,. The results are shown in Figure 8. These 
results show a clear preference for a small k (which allows 
a large m on the same budget M). Characterizing the opti- 
mal number k is left as an open question for future work. 

5. Conclusions 

We introduce the problem of feature multi-selection and 
provide two algorithms for the case of regression with 
mean averaging of judgments. Future directions of re- 
search include other learning tasks, such as classification, 
and other types of feature aggregation, such as median av- 
eraging (which, for binary features, is equivalent to taking 
the majority). An additional important question for future 
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Attribute 


reps 


coefficient 


dark hair 


1 


-0.39 


dark skin 


1 


-0.52 


female 


1 


-0.94 


heavy 




0.61 


long hair 




-0.59 


looking at the camera 




-0.45 


male 




4.88 


outdoors 




0.55 


tall 


17 


7.38 


wearing baggy clothes 


6 


-2.49 


(bias) 


N/A 


58.82 


Weight (pounds) 


Attribute 


reps 


coefficient 


attractive 


2 


10.08 


can see a lot of skin 


1 


-6.44 


female 


3 


-36.15 


has beard 




14.47 


heavy 


17 
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long hair 
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male 




10.09 


over 30 years old 


14 
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over 40 years old 
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-34.37 


thin 
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-12.80 
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-87.56 



#judements per feature during training 



Table 2. Examples of predictors generated by the full algorithm 
for prediction of height and weight. For each attribute and image, 
the selected number of judgment repeats is collected, and the co- 
efficient of the attribute is multiplied by the fraction of judgments 
that designated this attribute as true for the image. 



Figure 8. 'Slice' dataset: Training loss with different num- 
bers of judgments per feature during training, when ^training 
examples x ^repeats is kept constant. Numbers in legend indicate 
value of constant. 

work is how to carry out feature multi-selection in an envi- 
ronment with a changing crowd. 
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A. Analysis 
A.l. Notations 

For a symmetric matrix 8, let A max (S) and A m ; n (§) be the 
largest and the smallest eigenvalues of 8, respectively. For 
functions a and /?, we say that a < 0((3) if for some con- 
stants C, C > 0, a < C/3 + C. Similarly, a > 0(/3) 
indicates that for some constants C, C > 0, a > C/3 — C. 
we say that a < 0((3) if for some constants C, C > 0, 



a < C/31n(/3) + C . Similarly, a > f2(/3) indicates that 
for some constants C, C > 0, a > C/3 ln(/3) - C. 

Denote by Z r a diagonal d X d matrix whose z'th diagonal 
entry is I[r[i] > 0]. Recall that v[i] is the internal variance 
of feature i. For a vector r G N d , let V r be the diagonal ma- 
trix such that its i'th diagonal entry is zero if r[i] =0 and 
equal to v[i]/r[i] otherwise. Let V r be defined similarly 
but using the sample estimate v[i], defined in Eq. (2), in- 
stead of v[i]. Denote by n r the number of non-zero entries 
in r G N d . 

Let x G be an example with a repeat vector k such that 
k[i] > 2. 5 Let£(x)[i] = VarEst(x[i](l), . . . , x\i](k[i])). 
Given a repeat vector r, define the r-loss l T of the labeled 
example (x, y) by: 

4(w,x,y) = 

Let S = ((xi, yi), . . . , (x m , y m j) be a training set of la- 
beled representations drawn i.i.d. from D^. Denote the 
vector of training labels by y = (y\, , . , ,yi). We denote 
the average of l v over S by 

4(w, S) = — y~] e T (w,xi,yi). 

ie[m] 

Define 

E = - V x ; xf - V k; 
m i — J , 

ie[m] 

E r = Z r (S + V r )Z r , 

b = — Y] yixi. 

le[m] 

Note that the notation for S here is different than the one 
used in Alg. 1, since £ is not 'corrected' to be PSD. We 
denote the corrected estimate used in the algorithm by TP. 
Similarly, we denote by ££ the estimate for E r resulting 
from using E p instead of S. We have 

4(w, S) = w T E r w - 2w T Z r b + — y T y. 

m 

Define A r = A m i n (sub r (E)). Note that for A defined in the 
statement of Theorem 3.1, A = min re ^ B A r . 

5 Alg. 1 can be applied to k with different values per feature 
by simply using k[a] instead of k in step 7 and Vk instead of 
Diag(v)/fc in step 10. Our analysis holds for this more general 
algorithm. 
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A.2. Proof of Theorem 3.1 

To prove Theorem 3.1 we require several lemmas. All of 
the analysis below is under the assumption of Theorem 3.1, 
that all judgments and labels are in [—1, 1] with probability 
1. 

First, the following lemma links the covariance matrix of 
the population when averaging r judgments to the covari- 
ance matrix of the population when using the expected val- 
ues of the features for each object. 

Lemma A.l. Let r g N d . Then E r = Z r £Z r + V r . 

Proof. Consider a random object O drawn from T>o, and 
define the vector no sucn that for each coordinate i, 
Ho[i] = E[X[i]\0]. Define S = MoMo> so that s = 
Eo~z> [E ]. Similarly, define S r | = E r [X^X | O], so 
that S r = Eo~x> [S r |o]- Lastly, denote the variance of a 
feature on object O by v [i] = E[(X[i] ~E[X\i}\0}) 2 \0], 
and let V r |o be the diagonal matrix whose i'th entry is 
zero if r[i] = 0, and equal to i;o[i]/r[i] otherwise. Again 
V r = Ko \y r \o\- We will prove that for any object O, 

S r |o = Z r £ Z r +V r | . (3) 

The desired equality will follow by averaging over O ~ 
■Do- 

Denote the entries of S r |o by Sik- First, whenever r[i) = 
or r[k] = 0, entry (i, k) is zero for both sides of Eq. (3). 
Now, consider a non-diagonal entry Sik of S r |o for i ^ k, 
r[i] > and r[k] > 0. The (i, k) entry on the right-hand 
side of Eq. (3) is no[i]^o[k]- For the left-hand side we 
have 

s ik = E r [X[i\X[k] | O] = Er[X[i] | O] ■ E r [X[k] | O] 

Thus the equality holds for non-diagonal entries. 

Now, consider a diagonal entry su of S r . If r[i] =0 then 
both sides of Eq. (3) are zero. Assume r[i] > 0. We have 

s u = E r [XW | O] = E T [(-L Yl XIW)) 2 I O] 

r\i\ * — ' 

[ 1 J£[r[i]] 

2- £ E T [X\i](j).X\i](j')\0}). 

j'<j')j,i'e[r[»]] 

Conditioned on O, X [i] (j ) and X [i] (j') are statistically in- 
dependent. Therefore 

E r [x[i\(j).x[i\(f)\o} 

= E r [Xli\(j)\0}-E[Xti)(j')\0}=Hoti] 2 . 



In addition, E r [(X[i](j)) 2 ] = v [i] + HoW- Combining 
the two equalities we get 

*« = -pr(«o[*] + MoW 2 ) + -pia ■ (r[i\ 2 - r[z]) Mo W 2 

= MoM 2 + uo[i]/r[i]. 

This is exactly the value of entry (i, i) on the right-hand 
side of Eq. (3). We conclude that Eq. (3) holds for all types 
of entries, thus the lemma is proved. □ 

Using Lemma A.l, we can show that £ r (w, S) is an unbi- 
ased estimator of £(w, D r ). 

Lemma A.2. Let k, r g N d , so that Mi g [d],k[i] > 2. 
Let S be a sample drawn i.i.d.from Z3 k - For any w g R d , 
e(yr,D r )=E 8 [t r (w,S)]. 

Proof. We have 

E s [4(w, S)] = E s [w T S r w - 2w T Z r b + — y T y] 

m 

= w T E s [S r ]w - 2w T Z r E s [b] + E s [-y T y] . 

m 

From the definition of £ r , we have 

E s [E r ] = Z r E s [- V x,xf + V r - V k ]Z r 

i£[m] 

= Z r (£ k +V r -V k )Z r = £ r , 

Where the last inequality follows from Lemma A. 1 . 

In addition, E[b] = b, and E[^y T y] = E[F 2 ]. Therefore 

E s [4(w, S)} = w T S r w - 2w T Z r b + E[r 2 ]. 

On the other hand, we have 

£(w,Z? r )=E r [((w,X)-y) 2 ] 
= E r [(w T X - y)(X T w - Y)] 
= w T E[XX T ]w - 2w T E r [Xy] + E[F 2 ] 
= wS r w - 2w T Z r b + E[y 2 ] = E s [4(w, S)}. 

This completes the proof. □ 

The next step is to bound the rate of convergence of 
min w eK d 4(w, S) to £(r) = min weRti ^(w, D T ). We 
show that as m grows, the difference between the two 
quantities approaches zero. The following lemma provides 
guarantees under the assumption that the two minimizers 
have a bounded norm. We will then go on to show that such 
a bound on the norm holds with high probability, where the 
bound depends on the external covariance matrix S. 



Feature Multi-Selection among Subjective Features 



Lemma A.3. Let a > 0, and let W a = {w e R d \ 

||w|| < a}. Let 8 S (0, 1). Fix some w* € W a , and 
let w G argmin weR d £ r (w, S 1 ) such that ||w|| is mini- 
mal. With probability at least 1 — 8 over the draw of S, 
'/||w|| < a, then 



\£(w*,D F )-£ r (w,S)\<0 



a 2 n r ln(e/i5) 



in 



Since all feature judgments are in [—1, 1], v x [i] < 4, thus 
||v x || < 4 v /n7. In addition, j|u w || < ||w|| 2 . Lastly, 
l^nj — < 1- Thus, by standard Rademacher complexity 
bounds for the linear loss, 



n m (i b r o Wa , D k) < o ( !^-^ll u wll ' su Px II v, 



Proof. By Lemma A.2, £(w, D r ) = E k [4(w, X, Y)}. By 
Rademacher complexity bounds (Bartlett & Mendelson, 
2002), with probability 1 - 8, for all w e W a , 

E r [4(w,X,F)]< (4) 
4(w, S) + K rn {£ r o W a , D k ) + O (/VMl/<*)/m) , 

where lZ m (£ v o W a , Z?k) is the expected Rademacher com- 
plexity of the function class W a under l T , and (3 is the max- 
imal value of £ T on the possible inputs. Under our assump- 
tions, j3 < a 2 n T . 

We wish to bound the Rademacher complexity for our 
function class, defined as 

H m (troW a ,D k ) = -E S [E a [\ Sup <7|&(W,XJ,1«)|] 



\ y Tfl 

Summing the two terms, we have shown that 

K m tt T aW a ,D k ) < O 



Combining this with Eq. (4) and taking the minimum on 
both sides, we get 



a 2 n T hi(e/S) 



t{*r*,D T ) <4(w,5) + 



This completes one side of the bound. To bound £ r ( w , S) — 
£(w*, D r ), we note that, by the minimality of w and by 
Hoeffding's inequality, with probability 1 — 5 



Z<E[tn] 



tr(w,S)<£ r (w*,S)<E[£ r {w*,S)} + 0(f3 



ln(e/5), 



where a = [a\ , . . . , cr m ) are m independent uniform 
{±l}-valued variables, and S = ((x x , y x ), . . . , (x m ,y m )) 
is a random sample drawn i.i.d. from D^. Denote the com- 
ponents of £ T by 

C(w,x,y) = ((Z r w,x) -yf, 
^(w,x,y)= J2 ^W 2 «(x)W(4t-7^)- 

We have 

n m (£ r °W a ,D k ) < K m (£«oW a ,D k )+1l m (£ b r aW a ,D k ). 

We bound each of these Rademacher complexities individ- 
ually. The first term is the Rademacher complexity of the 
squared loss over the distribution generated by averaging 
over judgment vectors drawn from D^. Standard appli- 
cation of the Lipschitz properties of the squared loss fol- 
lowing Bartlett & Mendelson (2002) provides the follow- 
ing bound: 



n m {£ a r oW a ,D k ) <0 



Combining the two bounds and applying the union bound 
we get the statement of the lemma. □ 

We now turn to bound the norm of any w considered by 
our algorithm. Let 



E = - V x,xf - V k 

m ^ — J 



First, we relate the smallest eigenvalue of E to that of the 
true E. 

Lemma A.4. Let 8 € (0,1). With probability at least 1 — 8, 
A min (sub r (E)) > (5) 



A r — \JO(}ii(l/8) + n r ln(n r m))/m. 

Proof. It can be shown (see e.g. Sabato, 2012), by using 
an e-net of vectors on the unit sphere, that for a symmetric 
random matrix § G R™ x ", and positive /3, e, 7, 



P[A min (S) < P - H] < 



(6) 



For the second term, denote u w = Z r -(w[l] 2 , . . . ,w[d] 2 ) T , 
and v x = Z r • (-u(x)i, . . . ,v(x) d ) T . Then 

£ b r (w,x,y) = (-L - —-pr)(u w , v x ). 



ri 



[A max (S) > 7] + 0((n/e) n ) min P[u T Su < 0\. 

u:||u|| = l 



In our case, we have § = sub r (E) and n = n T . Due to 
the boundedness of the judgments, all the entries of S are 
at most 1. Therefore A max (§) < n r . We thus let 7 = n r , 
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so that P[A max (§) > 7] = 0. To bound P[u T Su < f3] for 
u such that ||u|| = 1, note that 

u T 8u=i-( ]T(u,Z r x,) 2 - J2 u\i] 2 v{^)\i]/k\i]). 

l£[m\ i:r[i]>0 

By McDiarmid's inequality (McDiarmid, 1989), for any 

t > 0, 

P[u T §u < E[u T Su] -t}< cxp(-2i 2 /mA 2 ), 

Where A is the maximal difference between u T §u with 
some xi, . . . , x TO and u T Su with x[i] replaced by some 
x'[i]. It is easy to verify that due to the boundedness of all 
x;, A < 0{n T /m). Further, u T E[S]u = u T sub r (E)u > 
A m in(sub r (E)) = A r . Thus, 

P[u T Su < A r - t] < cxp(-2mi 2 /n 2 ). 

Substituting into Eq. (6), we get 

P[A mi „(§) < A r - t - en r ] 

< 0((n r /e) n ')exp(-2mt 2 /n 2 r ) 
= exp(— 2mt 2 + n r ln(n r /e)). 

Letting e = t/n r and solving for S we get the statement of 
the theorem. □ 

Using Lemma A.4, we can now bound the norms of a min- 
imizer of ^(w, D r ) and of a minimizer of £ r (w, S) with 
high probability. 

Lemma A.5. Let w £ W 1 be a minimum-norm mini- 
mizer for £ r (w,S), that is, let M be the set of minimiz- 
es for £ r (w,S), and let w e argmin weM ||w||. Let 
w* be a minimum-norm minimizer for £ r (w, S). If m > 
il((ln(l/<5) + n r ln(n r ))/A 2 ), then with probability at least 

I — 5, sub r (S) is positive definite, and 

||*|| < ^ U - ^QMm + nrH^ 1 and 

II W* || < y/n^/Xr- 

Further, this holds simultaneously for any vector r' £ N d 
with the same support as r. 

Proof. Any minimum-norm minimizer for £ r (w,S) has 
w[i] = whenever r[i] = 0. Thus, we may assume w.l.o.g. 
that r[i] > for all i by deleting the coordinates with 
r[i) = 0. We thus have 

4(w, S) = w T S r w - 2w T b + — y T y. 

m 



If S r is positive definite, then the minimizer of 4(w, S) is 
w = S I T 1 b. Thus 

||*|| < WHXrnUK 1 ) (7) 
= |H|/A m i n (S r ) < y/n^/Xr. 

Now, t r = t + V r . Since V r t 0, we have 
•^min (5^r) > A m i n (E). A m i n (S) can be bounded from be- 
low by Lemma A.4. Substituting Eq. (5) in Eq. (7) we 
get the first desired inequality. For the second inequal- 
ity, it suffices to note that if sub r (S) is not singular, then 

W* = Sub r (E)- 1 b, thus || W *|| < A m f n (sub r (S))||fe|| < 

y^T/Ar. □ 

Combining Lemma A. 3 and Lemma A. 5 and applying the 
union bound we immediately get the following theorem. 

Theorem A.6. Let S be a training sample of size m drawn 
from Z?k, where k[i] > 2 for all i in [d). Fix r £ N d . Let w 
be a minimum-norm minimizer of £ r (w, S). Let 5 £ (0, 1). 
Ifm > f)((ln(l/<5) + n r ln(nr))/A 2 ), then with probability 
at least 1 — 5, sub r (S) is positive definite and 

\£(r)-£A^S)\<0(^^m). 

Moreover, the positive-definiteness holds simultaneously 
for all r' with the same support as r. 

Finally, we prove our main result, Theorem 3.1. 

Proof of Theorem 3.1 . We start by proving the result for 
the full algorithm. Assume that m > f2((ln(l/5) + 
n r ln(n r ))/ A 2 ), and consider a fixed r £ Rb- Recall that 

Uw, S) = w T S r w - 2w T Z r b + — y T y. 

m 

The full algorithm ignores, for each examined repeat vec- 
tor, all the matrix and vector entries that correspond to fea- 
tures with zero repeats. Thus we may assume w.l.o.g. that 
for all i, r[i) > 0. Theorem A. 6 guarantees with probabil- 
ity 1 — S, that £ is positive definite with probability 1 — 5. 
It follows that TP = £, therefore YP r = t t . We thus have 

4(w, S) = w T ±Pw - 2w T b + — y T y. 

m 

The minimizer of £ r (w,S) is w = (SJ!) _1 b. Substitut- 
ing this solution in £ r (w, S), we get min wgR d £ T (w, S) = 
-b^^b + ^y T y = -obj/r) + ±y T y. If the fea- 
tures are uncorrected then 

min 4(w, S) = -b T (t^)- 1 b + — y T y 

went"* m 

= -obj / (r) + — y T y. 
J m 
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Since all labels Y are bounded in [—1,1], with probability 
1 - 5, |^y T y - E[Y' 2 ]\ < 0{^\n(lJ5)/m). Thus it suf- 
fices to bound \l(v) — min wgR d £ r (w, S)\. This bound is 
given by Theorem A. 6. We now apply this bound simulta- 
neously to all the possible r G Rb- There are 

i£[B] V 7 i£[B] V 7 

such combinations, thus the bound in Theorem A. 6 holds 
simultaneously for all r G Rb, by dividing S with this up- 
per bound. For the requirement on the size of m we only 
need a union bound on the number of possible supports for 
r, which is bounded by d B . Finally, by noting that for all 
r G Rb, n r < B, we get the desired uniform convergence 
bound. 

Now, consider the Scoring algorithm. If S is diagonal, 
then ^(w, D r ) decomposes into a sum of n r independent 
losses over single-dimensional predictors with a single- 
dimensional covariance 'matrix' equal to the scalar a 2 [i] 
for feature i. The loss minimized by the Scoring algorithm 
decomposes similarly, using the covariance 'matrix' a 2 [i\. 
Thus, we apply the convergence bound of Theorem A. 6 to 
show convergence of each of the components individually, 
with n r = 1 and a union bound over B possible values of 
r[i], and then apply a union bound over d components to get 
simultaneous convergence of the parts of the loss. For the 
requirement on the size of m we only need a union bound 
over the number of components d. □ 

A.3. Proof of Theorem 3.2 

First, we prove the more general Lemma 3.3. We actually 
prove an equivalent mirror image of Lemma 3.3, by as- 
suming the gi are convex non-increasing and proving that a 
greedy algorithm minimizes /(r). We formally define the 
greedy algorithm as follows: 

1. r <- (0,...,0). 

2. For t = 1 to B, let r t = r$_i + e,-, where i is the 
coordinate of r that decreases /(r) the most. 

3. Return yb- 

Proof of Lemma 3.3. Let r* G argmin reij , B /(r). Since 
gi(-) are all non-increasing, we may choose r* such that 
Sig[d] r *H = B. Let r be a solution returned by the 
greedy algorithm listed in the theorem statement. Con- 
sider the iterations t\ < ... < t n G [B] such that the 
index ik selected by the algorithm at iteration tk satisfies 
r *fc[*fc] > r *[*fc] 5 so that it causes r to increase this coor- 
dinate more than its value in r* . Let ji,...,j n € [d] be a 



series of alternative indices such that 

r = r*+ e ^ - H 

fe£[n] ke[n] 

Denote 

r* L = r * + e ^ - Z! e ^ ■ 

k£[L] ke[L] 

Note that for all L < n, r[j L ] < r* L _ 1 [j L }. 
We prove by induction that for all L G [n], 

f(r* L ) = f(r*)- 
When setting L = n we will get /(r) = /(r*). 

The claim trivially holds L = 0. Now assume it holds 
for L — 1, and consider L. Denote for brevity t = th-\, 
i = ii, and j = jj,. Since the algorithm selected i over j at 
iteration t + 1, we have that 

/Ot + ei) </(r t + ej ). 

Subtracting X^zefd] 9i( r tW) from both sides we get 

9i{n[i] + 1) - 9i{rt[i]) < 9j(rt[j] + 1) - 9j(n\j])- 
It follows that 

< 9i(n[i\) - 9i(r t [i\ + 1) + 9i{n\j] + 1) - 9j(n\j])- 

Since r t [i] > f^-iM' by the convexity of e^, 

9i(n[i]) - 9i(rt[i] + 1) < - ^('"I-iH + !)• 

In addition, r t [j] + 1 < r£_ 1 [j] > therefore, again by con- 
vexity, 

9j(n\j] + l)-9j(n[j}) < 9j(rl-M -9j(rl-x\j] - !)■ 
It follows that 

0<5<(r2_i[*])-5i(»-i-i[*] + l)+ 
9j(r*L-i\j]) - 9j(r* L -i[j] - !)• 

Therefore 

By the induction hypothesis /(r*) = f(r* L _ 1 ). In addition, 
r* L _ 1 + e.j — ej = r* L . it follows that 

/(r!) < /(r*). 

Since r* is optimal for /, it follows that this inequality must 
hold with equality, thus proving the induction hypothesis. 

□ 
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Proof of Theorem 3.2. We have 



obj(r)= £ 



i:r[i]>0 



a 2 [i] + ii/r[i 



Define : N+ ->• K by 



AW 



as • 6[i] 2 



,2 [i] + 



Then 



obj(r)= £ /<(*)• 



i:r[i]>0 

We will define gi{x) : R — > M so that each </j is concave 
and 



We consider two cases: (1) If v[i] > 0, then gi(x) is the 
natural extension of fi(x) to the reals. (2) If v[i] = 0, fi 
is a positive constant for all positive integers. Let gi(x) = 
fi(l) for all x > 1, and let ^(x) = x/fi(l). 

In both cases, <7j(x) is concave with <?i(x) = /i(x) for 
x e N+ and ^(0) = 0. Therefore Eq. (8) holds. In ad- 
dition, in both cases g%{x) is monotonic non-decreasing. 
Therefore by Lemma 3.3, the greedy algorithm maximizes 
obj(r) subject to r € Rb- □ 




(8) 



ie[d\ 



